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Abstract—We study the efficient regular expression (regex) matching problem. Existing algorithms are the scanning-based algorithms 
which typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose 
various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the 
scanning-based method to verify these candidate locations. These methods become inefficient when there are still many candidate 
locations needed to be verified. 

In this paper, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional 
q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can 
be obtained by finding positions on q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based 
query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to 


build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world 
datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency. 


Index Terms—Regular Expression Matching, Positional Inverted Index, Query Plan. 





1 INTRODUCTION 


We study the regular expression (regex) matching problem, which 
aims to find all matching positions of a regex query on a document. 
Regex matching plays an essential role in many applications, such 
as information extraction (1). (2). entity matching , protein 
sequences matching (4). and intrusion detection (5). 6], (7). 

Although the problem has been actively studied in the fields of 
database and pattern matching, improving the efficiency of regex 
matching is still a strong demand for some applications, as the 
following examples show. 


e Regular expressions are widely adopted for packet content 
scanning, in which all protocol identifiers are represented as 
regular expressions in the Linux Application Protocol Clas- 
sifier (L7-filter), and over 90% of the CPU time is spent in 
regex matching when all 70 protocol filters are enabled in the 


L7-filter (8). 


e ClamAV is a popular open-source anti-virus scanner, which 
monitors the stream data by matching a virus signature library. 

A virus is reported if any signature is matched on the data. Cla- 
mAV contains 545,191 virus signatures, in which 16.49% of 
signatures are represented as regular expressions, but 99.3% of 

the matching time are caused by these regular expressions (9). 
Researches in this area mainly use the filter-verify schema 

to solve the regex matching problem. They first quickly identify 
candidate locations in a document where the corresponding sub- 
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string may match the regex query, by using substrings extracted 
from the regex query. Then they compile an equivalent automaton 
from the regex query, where each transition between two states 
in the automaton represents a character in the query, and run the 
automaton to check whether each candidate position can answer 
the query. MultiStringRE computes a set of prefixes for 
all strings matching the regex query, then uses a Commentz- 
Water-like algorithm to verify the document starting from each 
matching substring of these prefixes. GNU grep utilizes the 
necessary factors, the substrings in a regex matching string, to 
get candidate locations. Since a necessary factor could divide a 
regex into a left and a right part, two automata are constructed to 
verify a candidate location in both directions. NRGrep gets 
the candidate locations using the reversed prefixes of the regex 
and then verifies them. Most recently, |12] proposes N-Factor, the 
substring that cannot exist in the matching strings of a regex query, 
to further prune the candidate locations obtained from the above 
substrings. 

Recent methods also answer a regex query on a set of doc- 
uments. They leverage the boolean expression computed from a 
regex to locate candidate documents, that may contain a query 
result, from a document collection using a document-level g-gram 
based inverted index [2]. [3], [14]. Then they run an automaton 
to locate the answer substrings by examining every suffix in each 
candidate document. 

All the existing methods follow the filter-verify schema, and 
they use the index and automaton separately. The index is uti- 
lized to locate candidate positions in the filtering stage, then the 
automaton is used as a blackbox to verify these positions. The 
filtering aims to generate as few candidate positions as possible 
because the performance of verification relies on the number of 
candidates. The matching performance is poor when there are still 
many candidate positions needed to be verified. This inspires us to 
consider that, can we break the blackbox of automaton verification 
by a pure index-based matching? If so, all matching positions can 
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be directly computed through the index and the regex matching 
efficiency can be improved. 

In this paper, to combine the filtering and verify into one step, 
we design a gram-driven NFA (GNFA) by adopting the widely 
used g-gram inverted index, where each transition is labeled by 
a q-gram, rather than a character. The GNFA can represent the 
positional relationships among the matching q-grams required by a 
regex matching string (called positional constraint), which helps us 
to directly find the regex occurrences through the g-gram positions 
obtained from the inverted index. Then a good query plan based 
on the GNFA can be obtained by carefully selecting the checking 
order for the g-gram positions. Also, since the qg-gram inverted 
index is widely adopted by many search or store engines, e.g., 
Lucene and Elasticsearch [16]. our proposed method can be 
easily integrated into these systems without rebuilding the indices. 

The contributions of this paper are listed as follows. 


e We propose a gram-driven NFA (GNFA) to represent the 
language of a regex in Section[3] To the best of our knowledge, 
our work is the first one that combines the two steps, filtering 
and verify, into one step by using our proposed GNFA and 
q-gram inverted index. GNFA is used to find a good query 
plan, which converts the regex matching to checking the gram 
positions from inverted lists. 


e We devise the GNFA-based query plan in Section |4| which 
computes the regex occurrences by checking the positional 
constraints between the adjacent q-grams in GNFA. We de- 
sign optimization techniques to avoid unnecessary checks for 
positional constraints in a query plan. 


e By considering the selectivity of g-grams, we further propose 
a tree-based query plan to improve the checking order for 
positional constraints in Section [5] We show the checks for 
positional constraints can be represented by a binary plan tree, 
then a method for building the plan tree from GNFA is de- 
veloped by carefully choosing a checking order for positional 
constraints using the selectivity of related g-grams. 


e We conduct experiments using real datasets and demonstrate 
that the matching efficiency of our methods outperform previ- 
ous state-of-the-art methods by up to an order of magnitude in 


Section [7] 


2 PRELIMINARIES 
2.1 


In this paper, we follow the definition of a regular expression 
(regex) in [12]. Let © be a finite alphabet, a regular expression 
can be defined recursively as follows. 


Problem Formulation 


e Each string s € &* is a regular expression, which denotes the 
string set {s}. 

e (e1)™ is a regular expression that denotes a set of strings x 
such that, for a positive integer k, x can be rewritten as x = 
£1- £k and e; matches each string x; (1 < i < k). 

e (e1|e2) is a regular expression that denotes a set of strings x 
such that x matches e1 or e2. 

e (e; - e2) is a regular expression that denotes a set of strings x 
can be written as £ = £1 - £2, where e and eg match zı and 
2, respectively, and - denotes string concatenation. 


Note that other syntactical sugar of regex can be represented by the 
above definition [17]. For example, we can represent the Kleene 


2 


closure e* as (eļe™) and the optional unit (e)? as (ele). In this 
paper, we say (e)* and (e)* are the repeating units of a regex. 

The set of strings that match the regex Q is called the language 
of Q (denoted by L(Q)). The minimal matching length (Imin) of 
Q is minimal length of strings in L(Q). Example[1|shows a regex 
example, where spaces are explicitly represented by .. 


Example 1. Consider Q=su(.)*ch(ow|f)n. It matches 
strings which start with linux command su and end 
with commands chown or chfn. The language L(Q) = 
{suchfn, suchown, su.chfn, su.chown, suuchfn, 
suuchown,-:-} and lmin=6. 


A matching substring of any string in L(Q) on the document 
is called an occurrence of Q. The regex matching problem is to 
find all occurrences of the regex query from the document. 

A classical filtering-based approach is to locate the matching 
substrings of prefixes of a regex on the document as candidates, 
then run its corresponding automaton to verify them. For the above 
example, we find all strings in L(Q) starting with a prefix su, 
so the matching substrings of su are the candidates of regex 
occurrences. The widely used g-gram based inverted index can 
be adopted to efficiently find such candidate positions. 


q-gram Based Inverted Index. A q-gram of a string is a substring 
of length q that can be used as a signature for the string. For 
instance, the 2-grams of string chown are ch, ho, ow and wn. 
We can decompose a document to q-grams using a q-length sliding 
window, and build a positional inverted list for each q-gram (a.k.a. 
positional gram list) to record the starting positions where the 
gram appears. 


3  GRAM-DRIVEN NFA 


The characters in a document are not independent, there is a 
high probability that a character is followed by several certain 
characters (e.g., the substring java frequently appears in the 
code document). Existing methods typically utilize the automaton, 
whose transition is labeled by a character, to verify the substrings 
on the document. In this way, the characters from the document 
are sequentially checked and the dependency of characters is 
not considered. In this section, we propose a gram-driven NFA 
(GNFA), in which the transition is labeled by a g-gram. And we 
show all occurrences can be computed by checking the positions 
of g-grams in GNFA, which can be done using the corresponding 
q-gram inverted lists. 


3.1 Gram-driven NFA 


For any regex, a Thompson NFA can be constructed to accept 
matching substrings of its regex. We deem it as a character- 
driven NFA since the transitions are all based on single characters. 
In order to utilize the g-gram based inverted index, we generalize 
the character-driven NFA to a gram-driven NFA. 


Definition 1. (Gram-driven NFA) A gram-driven NFA (GNFA) 
Ag = (Sv, Se, Z, F,D) represents a set of gram sequences, 
where S, is a set of states (also called nodes), that contains 
one initial node Z and one final node F. Se is the set of 
all q-grams derived from the matching strings of regex Q. 
Transitions between nodes are labeled by elements of Se. D is 
the transition function that describes the transition information 
between nodes. 
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(b) Gram-driven NFA Ag for regex Q. 


Fig. 1. Character-driven NFA vs gram-driven NFA. 


We use A, and A, to denote the character-driven NFA and 
gram-driven NFA, respectively. Fig. [1] shows the Thompson NFA 
Ac and its corresponding GNFA A, for the running regex Q. 

Consider a transition from a node V; to another node V; 
through a gram g in our GNFA, we say g is a transition gram, and 
V; and V; the start and end node of g, respectively. Two transition 
grams are called adjacent transition grams if one gram’s end node 
is the other’s start node, e.g., gı and g5 in Fig.}1(b) 


Theorem 1. Given a regex Q, the number of g-grams in the GNFA 
of Q is bounded by nmt, where n and m are the numbers 
of characters and distinct characters in Q, respectively. 


For any character c in Q, there are at most m?~! q-grams 
that start with c, so the number of g-grams computed from Q is 
bounded by nmt}, which is also the bound of size of Ag. We 
introduce the construction method of A, in Sec. 

We define a path of Ag as a sequence of transitions (hence 
including nodes and transition grams) that starts from the initial 
node to the final node of A,. Adjacent transition grams on a path 
P share q — 1 common characters. Hence, we can reconstruct the 
string accepted by P by starting with the first transition gram and 
then concatenating the last character of all the following transition 
grams in P. We use str(P) to denote the set of corresponding 
strings costructed by P. For example, for the path P = Vo-V;- 
V3-V4-V7-Vg in Fig. [©] there must exist a string s in str(P) 
such that s = suchfn. 

Theorem 2. Given a regex Q, the GNFA A, for Q accepts exactly 
the language L(Q), iff A, satisfies the below properties: 


e For any path P in Ay, s € str(P) belongs to the language 
L(Q); and 

e for any string s in L(Q), there must exist a path P in A, such 
that s € str(P) can be constructed by P. 


3.2 Positional Constraints based on GNFA 


From the GNFA, we can find that, for any matching occurrence, 
there exist certain positional relationships among the related 
matching q-grams, e.g., the difference between the matching 
positions of su and uc in an occurrence is 1. Such positional 
relationships help us to find matching occurrences through the 
positional inverted index. 

For any transition gram g;, by looking up the positional q- 
gram inverted index, we can get an ordered list (gram list) I(g;) 
of matching positions where g; occurs. Fig. [2| shows an example 
of positional g-gram lists for the transition grams in Fig.[©)] 

Consider any path P on a GNFA Ag, let s be a matching 
substring in str (P), intuitively, for any two adjacent grams g; and 








le | le | 18) | le |185 | ge) | 1€) Kgio)| Kei) 
11 27 1 16 12 17 41 4 5 18 8 
35 65 28 51 36 | 37 60 | 61 43 38 | 21 
87 | 104 | 82 66 58 59 74 75 93 78 39 
103 | 199 | 83 72 69 73 145 | 184 | 185 | 94 79 






































198 125 | 105 | 88 239 | 252 174 
161 | 121 | 205 223 
217 | 235 275 





Fig. 2. An example of positional g-gram lists for the grams in a GNFA. 


gi+1 in P, there must exist a pair of matching positions (77;, mi1) 
such that 7; and T;}; are matching positions of g; and g;,1 on 8$, 
respectively, and 741 — T; = 1. 

Theorem [B]shows position difference between any two match- 
ing grams in an occurrence s for Ag. 


Theorem 3. Given a path P on a GNFA Ag, let g1,- - - , gp be the 
first, ..., the last transition grams in P. A matching substring 
s € str(P) exists iff there exist a set of matching positions 
T1,..-,7 Of 91,-..,g% Such that for any pair of grams gi 
and gj, Tj — Ti =Jj—1. 


We use C(7;,7;) to denote the positional constraint (i.e., 
Tj — Ti = j — 2) for the matching positions 7; and 7;, and 
say j — vis the positional offset of two transition grams. 

For the set of positions which satisfy positional constraints 
required by an occurrence, we call it a valid position combination 
of transition grams. For example, g1, 95, ge, gio and gı; are 
the Ist, 2nd, , Oth transition grams of a path in Fig. 
{35, 36, 37, 38, 39} is a valid position combination and there is 
an occurrence of such fn starting at position 35. Therefore, given 
the positions of transition grams, we can find all occurrences by 
computing all valid position combinations. 


4 GNFA-BASED QUERY PLAN 


In this section, we introduce a GNFA-based query plan. GNFA 
adopts the same q value with the q-gram inverted index. Hence, 
all the grams that appear in the GNFA are matching grams of the 
regex. Then, based on GNFA, we can easily find a good order 
for checking positions in all matching lists by starting from the 
shortest matching inverted list in the GNFA (i.e. from the gram 
with the smallest selectivity) to check every position in the gram 
list. We repeat such checking until all the matching lists have been 
examined or there is no matching position to answer the regex. 
A matching gram with the smallest selectivity will be examined 
first. Therefore, most positions in the matching lists will not be 
examined which improves the matching performance significantly. 

We first consider a simple case of computing occurrences from 
a single path of GNFA in Sec. [4.1] then we discuss how to extend 
the method to multiple paths in Sec. [4.2] 


4.1 Regex Occurrences from Single Path 


Recall Theorem [3| as a set of positions on transition grams are 
given, the positional constraints on all pair-wise transition grams 
are checked. Actually, most of the positional constraints required 
by Theorem {3|are redundant, as shown in the following. 


Removing Redundant Positional Constraints. According to the 
fact that C(7;, 7%) is satisfied if both C (m;, mj) and C(a;, Tk) 
satisfy their positional constraints, we have the positional con- 
straints required by Theorem |3| are satisfied, iff the positional 
constraint is satisfied for any adjacent transition grams in a path 
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P. Based on this property, for a single path P in Ag, we only need 
to check positional constraints for adjacent transition grams in P. 

For the same example, let 71, 75, ---, 711 be the positions of 
transition grams g1, g5, °°-, 911, only the positional constraints 
for adjacent transition grams are needed to be checked, i.e., 
C(m, T5), C (T5, Te), C (T6, T10) and C (T10, T11). 

In order to compute the valid position combinations which 
satisfy positional constraints on adjacent transition grams for 
a single path P, a naive method is to enumerate all position 
combinations for the transition grams in P and then check the 
positional constraints for them. Apparently, this method is ineffi- 
cient since many unnecessary position combinations are needed to 
be checked. Actually, we find the technique of merging lists ; 
can be used to compute the valid position combinations from 
the gram lists. 

A multi-list intersection algorithm max has shown the superi- 
ority to the other algorithms (20}. its basic idea is that individually 
examining each position 7» (called starting position) starting 
from the shortest list (called starting list), then searching mẹ on 
the remainder lists and return the first position m, > mp. If 
the obtained position T, > mT, on a list, it is called a failed 
position and can be a feedback for the starting list by skipping 
the positions < 7}. max avoids reordering lists when examining 
different starting positions by employing a fixed list accessing 
order (i.e., the order of list size), and meanwhile able to skip the 
invalid starting positions on the starting list using failed positions. 

We follow max to compute valid position combinations from 
multiple gram lists by (1) using the shortest gram list as the starting 
list (the corresponding transition gram is called starting gram); 
and (2) employing a fixed gram list accessing order according 
to the list size (i.e., the selectivity of the transition gram), but 
a different gram list accessing order is used since we need to 
iteratively check the positions of adjacent transition grams. 

Since the starting gram gp could exist in the middle of 
P, in this case, the positional constraints between the adjacent 
transition grams are iteratively checked from g» in both directions. 
Furthermore, for the transition grams needed to be checked from 
the different directions, the one with the shorter gram list is pref- 
erentially checked since there is a larger possibility that this list 
does not contain the required gram position (if so, then checking 
for the positional constraints can be terminated). Essentially, this 
method examines if a matching g-gram can be extended to a full 
occurrence of the regex, so we call this method Extend. 
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Fig. 3. Finding the valid position combinations starting from the shortest 
gram list. 


Fig. [3| shows an example of computing valid position com- 
binations by Extend, where l(ge) is the starting list since it is 
shortest. Since |l(g1)| < |I(gs)| < |l(g10)|, the position of gs is 
first checked, then checking the positions of g1, gio and g11 (1.e., 
for each starting position, C'(5, te), C (T1, 75), C(76, T10) and 
C(m10, 711) are orderly checked). For instance, for me = 17, 
only C'(m5, 17) is checked and turned out to be failed since 16 is 








4 


not a position of g5. A valid position combination is found when 
examining the starting position 7g = 37. Note that we can skip 
the starting position 7, = 73 since a failed position 7, = 87 is 
obtained from /(g,) when examining 7g = 59, which indicates 
the next starting position should no less than 85. 

From the above example, we know an atomic operation in 
Extend is that, checking a positional constraint under one of the 
positions is given. We use C(7;, j) to formalize it, where 7; is 
the given position and 7; is called an expected position for gj. Let 
6 be the positional offset between g; and g; in a path of GNFA, 7; 
can be computed by 7; + ô, then C(7;, 7;) is checked as follows. 


Checking Positional Constraint through Gram List. Given a 
positional constraint C (7;, 7; ), it is checked by searching 7; from 
L(g; ). Several searching algorithms can be used to search 7; from 
I(g;) and return the first position T; such that T > j |19] 3 {20}, 
[21]. If T; = ñj, then C (m;i, Tj) is satisfied. In this paper, we 
employ the gallop search to check a positional constraint which 
has the time complexity O(log |l(g;)|) in the worst-case. 


4.2 Regex Occurrences from Multiple Paths 


For the GNFA A, with multiple paths, the occurrences are the 
collection of the results computed from each path of Ag. A 
straightforward idea is that enumerating all paths of Ag, then uti- 
lizing the aforementioned method to process each path. However, 
this method suffers from the drawback of positional constraints 
shared by different paths are duplicately checked. Also, for the 
regex that contains repeating units, its GNFA is a cyclic graph, 
which makes it is infeasible to enumerate all paths of Ag. Next, 
we show how to use Extend to compute the occurrences from 
multiple paths of GNFA (the algorithm is depicted in Alg.[Ip. 


4.2.1 Determining Starting Transition Grams 


Unlike the case of single path, if A, contains multiple paths, there 
can exist multiple starting grams, we use S(g,) to denote the 
starting gram set. In order to examine the fewest starting positions, 
we utilize the min-cut of a GNFA as the starting grams. 
Definition 2. (Min-cut) Given a GNFA Ag, the min-cut of A, is 
a set of transition gram with the minimal summation of gram 
list size, which partition the nodes into two disjoint subsets. 


For the example in Fig. ge is the min-cut of Ag since 
it has the minimal summation of gram list size among all cuts of 
Ag, accordingly, ge is still the starting gram for Ag. 

As shown in Fig. |3| if the starting positions are examined 
in the ascending order, the obtained positions could be used to 
further skip the invalid starting positions. To this end, when there 
are multiple starting grams in A,, we can build a min-heap H for 
the top positions on the starting lists, so that the starting positions 
from different gram lists are examined in the ascending order 


(line 2 in Alg. [Ip. 


4.2.2 A Query Plan Based on GNFA 


Next, we show how to examine the starting position in the case of 
multiple paths. Unlike the case of single path, a transition gram 
can have multiple adjacent transition grams from different paths 
in one direction, e.g., gg has two adjacent transition grams g7 and 
gio in the forward direction in Fig.[1()| The positional constraints 
for these transition grams are needed to be checked, so that the 
occurrences from different paths are computed. 

In order to do so, two sets Sp and Sẹ are used in Extend to 
record those positional constraints needed to be checked in the 
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Algorithm 1: Extend 


Input : A GNFA A, and the positional inverted index 
Output : The position set R of regex occurrences 
1 S(go) + min-cut of Ag;// Starting grams 
2 Build a min-heap H for the gram lists of starting grams in 
S (ge); 
3 while H is not null do 
4 Te < H.pop(); 
5 Let gp be the starting transition gram such that m € L(g); 
6 Ss + 0, Sa + 0, flagF + false, flag! < false; 
7 for each successor (predecessor) gram gj of gp do 
8 
9 

















Tj — TM + 1 (ñ; +} Tm — 1); // Tm is given 
Add C(mp, ij) to Sp (add C (Te, Tj) to Sb); 
10 while S;! = Ø or S,! = do 
u C (Ti, Tj) < Select(S'y, Sp); 
12 Pop C (mi, 7;) from the corresponding set; 
13 if Check(C (ri, 7;)) is true then 
14 Tj < ñj; // ñj is found from I(g;) 
// consider adjacent transition 
grams 
15 if C (ri, tj) comes from Sẹ (or Sp) then 
16 if g;’s end node= Ag.F (or g;’s start node = 
Ag.Z) then 
17 flagF < true (or flagI + true); 
18 continue; 
19 for each successor (or predecessor) gram gz 
of g; do 
20 Tz + Tj +1 (orm, & Tj — 1); 
21 Add C (1j, 72) to Sp (or S»); 
22 if (S; =@ and !flagF) or (Se = Ú and !flagI) then 
23 break; // failed in a direction 
24 if flag! and flagF then 
25 Add ms to R; // find a regex occurrence 
26 else if lb, exists then 
27 Skip the starting positions < lbr from 
l(gv);3// feedback 








28 return R; 


forward and backward directions, respectively. Given a starting 
position 7, the first step is to initialize Sp and Sẹ using the 
positional constraints between g, and its adjacent grams in both 
directions (lines 7-9). Then, a positional constraint C(7;, Tj) is 
chosen from Sp (or Sp) and checked (lines 11-13). If C (m;, T4) is 
satisfied (i.e., 7; € 1(g,;)), the positional constraints for adjacent 
grams of gj are added to the corresponding set (lines 15-21). 


Likewise, Extend uses two flags flagF and flagl to indicate 
if the positional constraints in forward and backward directions 
are satisfied for a starting position. If the leftmost (rightmost) 
transition gram in a path is accessed and the corresponding 
positional constraint is satisfied, then flagI (flagF) is marked by 
true (lines 16-17). The positional constraints in sets Sp and Sp 
are iteratively checked until checks in one direction are failed (line 
22) or all positional constraints required by a path are satisfied 
(i.e., flagF and flag! are true) (line 24). 

In Extend, each checked positional constraint is selected 
from Sp or Sp by the below strategies (line 11): (i) choosing 
the set with a smaller summation of list size for the corre- 
sponding grams of positional constraints from {S', Sp} (ie., 
min{ > o m,#,)es (U(9s)|),S € (Sp, So} p; Gi) within a set S, 


choosing the positional constraint which has a larger gram list size 


5 


(i.e., max{|l(g;)|,C (a, tj) E€ S}). The rationalities behind the 
strategies are that (i) the positional constraints on the transition 
grams with a smaller selectivity are more possible to be failed, 
so that fewer checks of positional constraints are needed for a 
starting position which is not a matching position; (ii) if a starting 
position is a matching position, then the positional constraint 
whose corresponding gram has a larger selectivity in a direction 
is more possible to be satisfied, so the checks for the positional 
constraints from other paths can be avoided. 

Fig. [4] shows an example of computing the occurrences from 
multiple paths by Extend. Table |1| presents the corresponding 
checked positional constraint in each step. At first, for each start- 
ing position, Sp and S, are initialized, since |l(g10) + 1(g7)| < 
\l(gs) + U(ga)| and |l(g10)| > |l(g7)|. C(6, 710) is always the 
first checked item. Extend needs check 15 positional constraints 
for the four starting positions, and finds me = 37 is the matching 
position of an occurrence. 
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Fig. 4. Examining starting positions using Extend. 


TABLE 1 
lllustration of Extend, where the bold positional constraint is checked in 
each iteration and the removed items are exclusive to the checked item. 





Te = 59,76 = 73 
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3 | failed flagF=true oa a C(17, 7s) Cae T 
as zs C(m6,7 
4 C (re, s)| C (rs, ño) ees 
5 C (Ts, 71)] failed 
6 flagI=true 





Notice that Extend can be used for the regex containing 
repeating units. The GNFA A, is a cycle graph for such a regex, 
consider a path with cycles in A,, Extend iteratively checks the 
positional constraints on adjacent transition grams, including the 
grams existing in cycles. Accordingly, the occurrence produced by 
a path with cycles can also be computed. 

Let lmax be the maximal length of matching strings in 
L(Q) and |£(Q)| be the bound of the number of paths in 
Ag. To examine a starting position, the number of checked 
positional constraints is bounded by lmaz|L(Q)|. Let Nz be 
the maximal gram list size, the algorithm costs time O(|£(Q)]) 
to choose a positional constraint and time O(log Nz) to check 
it. Moreover, the number of starting positions is bounded by 
|L(Q)|Nz, accordingly, the time complexity of Extend is 
O(|L(Q)I2Nilmax (log Nz +|L(Q)))). 
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4.2.3 Skipping Unnecessary Positional Constraints 


In the following, we give two optimization techniques to further 
reduce the checks for positional constraints. 


(1) Exclusive positional constraints on different paths. 

Different from the case of single path, another feature of 
multiple paths is that the positional constraints from different paths 
can be exclusive. 

Definition 3. (Exclusive positional constraints) Let g;, gj and 
gz be the transition grams in a path of Ag such that g; and g; 
have same positional offset with g}. For a given position 7, of 
gz, if gi A gj, then the positional constraints C (mz, 7;) and 
C (Tz, Tj) wrt. gi and gj are exclusive. 


Consider the example in Table [I] for a given position 7, for 
ge, since gy and gig cannot appear in the same position 7, + 1, 
then C (T6, 77) and C'(76, 710) in Sp are exclusive. 

Therefore, if a positional constraint C (Ti, Tj) is satisfied, its 
exclusive positional constraints must be failed and can be skipped. 
For the example in Table [I] when examining me = 17 and me = 
37, C' (T6, 710) is checked in the first step and satisfied, then its 
exclusive positional constraint C (m6, 77) is not necessary to be 
checked for the two starting positions. 


(2) Skipping invalid starting positions. 

Similar to the case of single path, the failed positions can 
also be used to prune the valid starting positions for the multiple 
paths. Consider the transition grams gp and gj existing in a path 
of Ag, let m, be a starting position of ga and C (mp, Tj) be the 
positional constraint between gẹ and gj for ma. As illustrated 
in Fig. |5| if a failed position T; > Tj of gj is obtained when 
checking C (mp, T4) (i.e., 7 is a non-matching position), to make 
C (ñp, T3) is satisfied, then we know the next starting position 
should no less than 7 = 7, — (Tj — Tp). We call T, an expected 


j 
starting position for the path containing g» and gj. 


positional offset 














Th — J 
skippable starting positions 


Fig. 5. Skipping invalid starting positions by failed positions. 


For the multiple paths sharing a same starting gram gp, if 
a starting position ma of gẹ is a non-matching position and all 
paths of Ag produce the expected starting positions. Obviously, 
the minimum of these expected starting positions is a lower bound 
lb» for the next starting position. Let F (g) be the set of transition 
grams producing failed positions from different paths, if F (g) is 
a cut of A, (i.e., all path produce the expected starting positions), 
the lower bound lbr is computed as follows. 


P A 
r = g 


null otherwise. 


(1) 
For the running example, for 7, = 59, failed positions 78 
and 93 are obtained by gig and gg while checking C (r6, 710) 
and C (Ts, ñg). Since gig and gg is a cut of Ag, so we get the 
expected starting positions 77 and 90, then lbr = 77. Hence, the 
starting position me = 73 can be skipped since it is less than lbr. 
Notice that the GNFA-based method can accelerate the match- 
ing efficiency by using the intermediate results (failed positions) 
obtained by checking the starting positions which are checked out 
to be the non-matching positions. This is one of the reasons that 
our proposed method outperforms the existing methods. 


6 


5 OPTIMIZING QUERY PLAN USING GRAM SELEC- 
TIVITY 


Although the algorithm Extend computes the occurrences of a 
regex using the GNFA, it suffers from the following drawbacks. (i) 
In Extend, we have to use two additional sets (Sp and Sp) to find 
all occurrences from different paths of the GNFA, such operations 
lead to an extra cost since they are frequently updated after each 
positional constraint checking (see the example in Table m). (ii) 
The checking order is not fully optimized by iteratively checking 
the positional constraints of adjacent transition grams, especially, 
when the transition grams of GNFA differ greatly in the selectivity. 

In this section, we show the checks for positional constraints 
can be represented by a binary plan tree. Using a binary plan tree, 
the cost for maintaining the additional sets can be avoided and the 
checking order can be further improved by building a good plan 
tree. Next, we first discuss the tree-based query plan in a simple 


case that the regex does not contain any repeating unit (Sec. 
and|5.2), then extend it to the regex with repeating units (Sec.|5.3). 


5.1 


The idea behind the tree-based query plan is that, before executing 
a query plan, we compute all positional constraints from different 
paths and organize them by a tree-based query plan (called binary 
plan tree). Finally, examine starting positions by using this plan 
tree and q-gram positions from gram lists. 


Tree-based Query Plan 


5.1.1 Positional Constraints with Starting Grams 


In order to check positional constraints in different orders, we 
next show another way of checking positional constraints required 
by Theorem [3] Based on the property that C (Ti, mk) is satisfied 
if both C(m;,7;) and C (Tj, Tk) are satisfied, we get iff the 
positional constraint C (Tp, mi) between the starting gram gy and 
any transition gram gi (gi # Jp) in a path P is satisfied for a set 
of gram positions, then there is an occurrence (Theorem[3]holds). 
For the ease of presentation, we consider another regex query 
Q=su (.) ?ch (ow| f) n without any repeating unit, its GNFA is 
shown in Fig. [6] Consider the path containing g1, 95, g6, gio and 
gii, according to the above property, if C(76, 71), C(76, Ts), 
C (T6, 710) and C'(m6, 711) are satisfied for a starting position 
Teg, then an occurrence is found. 
G 


ee 
Pen 


hog; 


ag Ae" 
\ F, z Na 
¥75) Vay. V) A 
uE I I EI hE NY 





su . 
Vg} 


Fig. 6. The GNFA for the regex without any repeating unit. 


In this way, as a starting gram gp is selected, all the positional 
constraints from multiple paths can be computed before finding 
occurrences from gram lists, this gives the opportunity of check- 
ing the positional constraints in different orders. All positional 
constraints from different paths are computed by performing a 
bidirectional traverse on A, from gp. We use Sc to denote the 
set of positional constraints in A,. Fig. |7| shows the example of 
computed positional constraints from the GNFA in Fig. [6] 

Note that a positional constraint C (7p, 7;) can be shared by 
multiple paths if a transition gram g; can be reached to gẹ through 
these paths with the same positional offsets. But for the gram 
gi that has different positional offsets with gẹ in different paths, 
different positional constraints are created. 
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Fig. 7. Positional constraints for the GNFA in Fig. 6] 


5.1.2 Plan Tree 


Actually, the priority of checking positional constraints from 
different paths depends on the selectivity of the corresponding 
transition grams. Also, the positional constraints follow a certain 
checking order for different starting positions. We find this order 
can be computed before examining starting positions and be 
naturally represented by a binary tree structure. 


Definition 4. (Binary plan tree) A binary plan tree 7 is a 
rooted binary tree, in which each internal node containing two 
children represents a positional constraint, and each terminal 
node represents the matching result for a starting position. 


There are two types of terminal nodes called T-terminal 
and F-terminal, which represent a successful matching (i.e., an 
occurrence is found) and a failed matching for a starting position. 
For each internal node C (mp, 7;), if it is satisfied, then the left 
child of C(7y, 7;) is checked; otherwise, checking its right child. 

There can exist multiple plan trees that represent different 
checking orders. Fig. |8] shows the examples of two different 
plan trees for the running example. The plan tree in Fig. Ba 
has the same checking order with Extend. Fig. [8[b) is another 
plan tree with a different checking order, the leftmost T-terminal 
means the examined starting position is a matching position 
of the string su.chfn in L(Q), since C(76, 72), C (T6, 7), 
C (T6, 74), C(76, 711) and C(7¢, 710) are satisfied and they exist 
in the path representing the matching string su.chfn. 
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(a) A plan tree that has the same 
checking order with Extend. 


(b)A plan tree with improved checking 
order for positional constraints. 


Fig. 8. Two different plan trees for the GNFA in Fig. [6] 


When S(gp) contains multiple starting grams, for the starting 
position 7, from gè € S'(gp), only the paths that contains gẹ can 
get an occurrence on m. Hence, instead of building a plan tree for 
all positional constraints, for each starting gram gẹ E S (gb), we 
build a plan tree 7 for gy using the positional constraints such that 
the related transition grams exist in the same paths with gp. 


5.1.3. Matching Regex by Binary Plan Tree 


Next, we introduce how to use the plan tree to examine the starting 
positions. The algorithm is described in Alg.|2} Initially, for each 
starting gram gẹ E€ S(gp), a plan tree T is built for the positional 


Algorithm 2: TreeMatch 


Input : The starting gram set S(g») and positional inverted 
index; 
Output : The position set R of regex occurrences; 
1 for each starting gram ga E€ S(gv) do 
2 | Build a plan tree 7 for the positional constraints whose 


transition grams exist in the same path with gp; 


3 Build a min-heap H for the gram lists of transition grams in 


S(go); 
4 while H is not null do 











5 Tb + H.pop(); 

6 Let g» be the starting gram such that m, € I(g,), and T be 
the corresponding plan tree of g»; 

7 Let C (m»s, 7) be the root of T; 

8 while C (m, 7) is not terminal node do 

9 if Check(C (r+, Ti)) is true then 

10 | C (Ti, Ti) < the left child of C (ms, Ti); 

11 else 

12 L C (Ts, Ti) < the right child of C (ms, ți); 

13 if C (ms, ti) is a T-terminal node then 

14 E Add m to R; 

15 else if lb, exists then 

16 | Skip the starting positions < lb, from l(g»); 

17 return R; 


constraints such that the related transition grams share the same 
paths with gẹ in Ag. Same to Extend, a min-heap H is built for 
the gram lists of the starting grams, so that the starting positions 
are checked in the ascending order. 

As we aforementioned, using additional sets to maintain 
positional constraints leads to the extra time cost. The plan 
tree avoids this problem by checking positional constraints in 
a deterministic fashion. Given the plan tree 7 and a starting 
position, the positional constraints are checked from the root of 7 
until a terminal node is reached (lines 7—12). For each positional 
constraint C (mp, ti), only one child of C (mp, Ti) will be checked. 
If C(ms, Ti) is true, then its left child is checked; otherwise, 
checking the right child. 

Besides, another advantage of the plan tree is that unnecessary 
checks caused by exclusive positional constraints can be easily 
avoided by putting them into different subtrees. For the example 
in Fig. [8{b), according to the definition of exclusive positional 
constraints, we get C(76, 72) and C (re, 7) in Fig. [7| are ex- 
clusive, C (me, 1) exists in the right subtree of C(76, 72), so the 
algorithm avoids checking C (76, 7+) when C(76, 72) is satisfied. 

Same to Extend, the obtained failed positions can be used to 
skip the invalid starting positions if an F-terminal is reached for a 
starting position 7 (line 15-16). Here, F (g) used to compute lbr 
(see Sec. 4.2.3) is the set of transition grams whose corresponding 
positional constraints are failed on the root-to-F-terminal path. 

In TreeMatch, the number of checked positional constraints 
for a starting position equals to the number of internal nodes in 
the root-to-terminal path of 7. Consider the example in Fig. [8{b), 
5 positional constraints are checked for the starting position 37 
since there are 5 internal nodes existing in that root-to-T-terminal 
path. TreeMatch needs to check 11 positional constraints for all 
starting positions, which is less than 15 checks used by Extend. 

Now, we consider the time complexity of TreeMatch. The 
algorithm checks at most lmaz + |L(Q)| positional constraints 
(i.e., the bound of internal node number in a root-to-terminal 
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path) to examine a starting position. Also, the cost of checking 
a positional constraint is only O(log Nz) and at most |L(Q)| Nz 
start positions are examined. Hence, TreeMatch has the time 


complexity of O(|L(Q)| Nz (l L(Q)|) log Nz). 





5.2 Computing A Good Plan Tree 


As we previously introduced, the different positional constraint 
checking order leads to the different plan trees. Next, we give the 
method of computing plan tree by choosing a good checking order 
for positional constraints. To ease the presentation, we discuss the 
fundamental case of a starting gram is shared by all paths of Ag. 

As shown in Sec. we can compute the set of positional 
constraints Sc as the starting transition gram is determined. A 
plan tree is then recursively computed from Sc which includes 
two steps in each recursion: (1) choosing a positional constraint as 
the internal node and (2) computing the corresponding positional 
constraints for the left and right subtrees. 


5.2.1 Positional Constraints for Subtrees 


We first consider the second step, which also is the key factor 
of examining a starting position in the deterministic fashion by a 
plan tree. Let ex(C (7p, 77) ) be the set of exclusive positional con- 
straints for a chosen internal node C'(m,, 7;), the below Lemma 
gives the properties to compute the left and right subtrees. 


Lemma 1. For a given starting position 7, and a positional 
constraint C'(7,, 7;) between a transition gram g; and gp. 


e If C(»,7;) is true, there has no occurrence on ma for 
the matching strings represented by the paths which contain 
the transition grams whose positional constraints exist in 
ex(C (mp, Ti)). 

e If C(my,7;) is false, there is no occurrence on m, for the 
matching strings represented by the paths which contain g;. 


For a positional constraint C (7p, 7; ), the positional constraints 
needed to be checked when C'(7,7;) is satisfied (i.e., exist in 
the left subtree) are called compatible positional constraint set, 
denoted by S*(C(7», 7;)). Likewise, we define the incompatible 
positional constraint set S~(C(mp,7;)), that are the positional 
constraints checked in the right subtree of C (mp, 7%). 

Lemma |1| can be used to compute St(C(7,7)) and 
S- (C (Te, 7;)). Let Pay be the path collection on Ag, Po(n,,#;) 
and Pex(C(m,,ã:) be the path collections that contain the transition 
gram for C (rb, Ti) and the transition grams for positional con- 
straints in ex(C'(m, 7;)), respectively. According to Lemma [I] 
(1) if C (mb, Ti) is satisfied, only the matching strings represented 
by the paths except Po2(c(m,,#;)) can produce an occurrence, so 
the positional constraints for transition grams in such paths are 
S*(C (mp, Ti)); (2) if C (mb, Ti) is failed, only the paths except 
Po(my,#;) Produce an occurrence, as follows. 


ST (C (To, 7)) = Sa N{S( Pan — Pen(otms,#:))) — C (To, 7) } 
S~ (C(t, Ti)) = SoN S( Pau — Poy (mb, #9) (2) 


Fig. [9fa) shows the example of computing positional con- 
straints for the left and right subtrees when C'(76, T2) is chosen as 
the root. Then, we can use the same way to recursively compute 
the subtrees through the compatible and incompatible constraint 
sets of C (T6, T2). As Fig. |9{b) shows, to build the left subtree, 
an internal node C (76, 77) is chosen from S*(C(76, 7t2)), then 
recursively computing the subtrees of C'(76, 77). 
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internal node from Sc. node for the left subtree of C(7¢,72). 


Fig. 9. Computing compatible and incompatible positional constraints for 
the chosen internal nodes. 


5.2.2 Recursively Computing Plan Tree 


We next consider the issue of choosing a positional constraint as 
the internal node for each subtree. For the starting positions which 
are not the matching positions of the regex (i.e., non-matching 
positions), intuitively, a plan tree 7 checks the fewer positional 
constraints if they are turned to be failed in 7 (i.e., reaching F- 
terminals) as early as possible. Based on this intuition, we use 
the following strategy to select a positional constraint which has a 
lower selectivity for the corresponding gram. 


Algorithm 3: BuildTree 


Input : The positional constraint set Sc, the set of satisfied 
positional constraints SG; 
Output : A plan tree 7 for Sc; 


1 if Sc = Ú then 








2 if SG has the positional constraints required by a path of 
Ag then 

3 T < CreateTerminal(T-terminal); 

4 else 

5 T < CreateTerminal(F-terminal); 

6 else 

7 Choose a positional constraint C (ms, 7:) from Sc; 

8 Compute S*(C(m, 7i)) and S~ (C (me, 7:)) for 
O (Tis Ti); 

9 Ti + BuildTree(S* (C (ms, 7)), {SG U ti); 

10 T- < BuildTree( S7 (C (ms, 7i)), S6); 

u T < CreateTree(C (T, i), Ti, Tr); 


return 7; 


= 
N 


Since a positional constraint C (mp, ñ;) can be shared by 
different paths of Ag. If C (mp, 7;) is failed, then no occurrence is 
produced through these paths for the starting position 7p, hence, a 
higher priority should be assigned to such a positional constraint. 
Let c be the number of paths that share C'(m,, Ti), we define the 
weight of C(mp, Ti) by Haal . The positional constraint with a 
smaller weight is chosen as an internal node in preference. 

We design a recursive algorithm Build Tree to compute a plan 
tree from the set of positional constraints, as shown in Alg. [3] As 
an initialization, we compute the set of positional constraints Sc 
from Ag. In each recursion, a positional constraint is chosen from 
Sc as an internal node, then the compatible set and incompatible 
set of the chosen positional constraint are computed and used for 
building subtrees (lines 7—11). An additional set So is used to 
record the satisfied positional constraints for each root-to-terminal 
path. A terminal node is created until Sc is empty for a subtree 
(lines 1-5). At that time, if So contains the positional constraints 
required by a path of Ag, then a T-terminal is created; otherwise, 
the algorithm creates an F-terminal. 
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For example, consider the GNFA A, in Fig. |6|and gram lists 
in Fig. B] All positional constraints are used as the initial set Sc. 
Since C (t6, 72) has the minimal weight (|g2| is smallest), then it 
is chosen as the root. Next, the subtrees are recursively computed 
and finally the plan tree in Fig. [8[b) is built. 

Given a GNFA Ag, let Ny be the number of transition grams 
in Ag, the number of positional constraints for Ag is bounded 
by |L(Q)|Nr. Thus, the number of recursions in BuildTree 
is at most |L(Q)|Nr. In each recursion, S*(C(mp,7;)) and 
S~(C(m», 7i)) can also be computed in time O(|L(Q)|Nr). 
Hence, BuildTree has the time complexity O((|L(Q)|Nr)?). 


5.3 Extending Tree-based Query Plan for the Regex 
with Repeating Units 


For the regex without any repeating unit, its GNFA is acyclic, so 
the paths of GNFA are finite and all positional constraints can be 
computed from the paths. However, for the regex containing the 
repeating unit, the GNFA A, contains the cycle which can occur 
many times in a path, this leads to a transition gram could have 
indeterminate positional offsets with the starting gram in the path 
containing this cycle, e.g., the positional offset between gz and ge 
can be any value that no less than 2 for the GNFA in Fig.|1(b) 

We utilize a hybrid method to solve this problem. The basic 
idea is to convert the cyclic GNFA A, to an acyclic GNFA A% 
by creating an auxiliary transition to replace the transition grams 
which are affected by cycles, then a plan tree 7 is built from the 
rewritten GNFA Az and the positional constraint for the auxiliary 
transition in J is checked by invoking Extend (i.e., checking the 
positional constraints for the adjacent transition grams in Ag). 

In order to obtain an acyclic GNFA Aps we traverse Ay 
starting from g» in forward and backward directions, respectively. 
As the transition grams in a cycle are accessed, then its following 
transition grams are replaced by an auxiliary transition. In the 
forward direction, the created auxiliary transition ends at the final 
node of Ag, but in the backward direction, it starts at the initial 
node of Ag. If there are multiple cycles in Ag, multiple auxiliary 
transitions are created in Ae. 





—~ SU ~ A ~ 
ORO 85 Wg) 
uc C. 


auxiliary» Vo) et 
transition V>) Je 








Fig. 10. The rewritten acyclic GNFA. 


Consider the example in Fig. we create the auxiliary 
transitions by bidirectionally traversing starting from gg. In the 
backward direction, as g3 (exist in a cycle) is accessed, an 
auxiliary transition is created to connect the initial node Vo and 
the start node of g3, so the transition grams affected by the cycle 
is replaced by the auxiliary transition. 

For the rewritten acyclic GNFA A’, we can build a plan tree 
through a by creating an auxiliary node Ca for the auxiliary 
transition. In TreeMatch, the auxiliary node is checked by in- 
voking Extend to iteratively check the expected positions of the 
adjacent transition grams represented by the auxiliary transition. 
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If the initial (or final) node is reached, then C'a is satisfied. Since 
checking an auxiliary node needs more cost, the summation on 
the gram list size of transition grams represented by the auxiliary 
transition is used as the weight of Ca, so that Ca has a larger 
possibility to be checked after other positional constraints when 
building a plan tree. Fig. [11] shows the plan tree built through the 
above rewritten GNFA AQ. In A}, since C(76, 79) is shared by 
three paths, it has the smallest weight in initial, then it is chosen 
as the root by BuildTree. 
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Fig. 11. The plan tree for the GNFA in Fig.|1(b) 


6 CONSTRUCTING GRAM-DRIVEN NFA 


In this section, we introduce how to construct a GNFA Ag through 
a character-driven NFA Ac. It is nontrivial to convert a A, toa Ag, 
since (1) there exist disjunctive units or repeating units in Ae; (2) 
common q-grams are shared by different paths in Ac, and these 
q-grams are needed to be distinctly represented in Ag. 

We propose the algorithm BuildGNFA to compute a GNFA. 
The basic idea is to traverse A, in a depth-first fashion and 
meanwhile build A, through the g-grams computed from different 
paths of A,. BuildGNFA solves the above problems by associating 
a subpath of A, for each created node V in A, (except the 
initial and final nodes), which represents the (q-1)-length common 
substring shared by V’s two adjacent transition grams. As shown 
in Fig. Vı is associated with a subpath v1-v2 in Fig. 
which represents the substring u shared by g and gs (gı and g2). 

In this way, BuildGNFA obtains two advantages. Firstly, it is 
easy to compute the transition grams starting at V; in Ag, even if 
Vi has multiple successor nodes. Since a subpath P; is associated 
with V;, in order to get the transition grams starting from V;, we 
traverse A, from the last node of P; until the next characters are 
found from the transitions of A, (lines 11-27). Fig. shows 
the example of constructing A, from A, in Fig. A When 
V, is created and associated with the subpath v1-v2, in order to 
compute the successors of V, we traverse A, from v2 and two 
next characters — and c in different paths are obtained, then two 
successor nodes V2 and V3 are created with associated paths v3-v4 
and vs-vg¢, and the transitions are assigned q-grams u- and uc. 

Secondly, the common q-grams shared by different subpaths 
in A, can be distinctly represented in Ag. We observe that the 
common q-grams are distinctly represented in Ag if the nodes in 
Ag have the distinct associated subpaths. Hence, we build a GNFA 
by avoiding creating duplicate nodes of Ag with same associated 
subpaths. In BuildGNFA, before creating a transition gram in Ag, 
a subpath P representing a q-gram will be obtained (lines 14—15). 
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Fig. 12. Associated paths on the nodes of Ag. 


At that time, we first check if the subpath P’ of P representing 
a (q-1)-length suffix is already associated to an existing node V. 
If so, V is directly used as the successor node (lines 24-26). For 
example, when computing the successor nodes of V2 in Fig. [12| 
we traverse A. from v4 and get characters . and c with subpaths 
v3-U4 and U5-U6, respectively. Since V2 and V3 with subpaths v3- 
va and us-vg in Ag were already created, then V2 and V3 are set 
as the successor nodes of Vz directly, the transitions with q-grams 
__ and c are created from V2 to V2 and V3, respectively. 

For a node v; in Ae, BuildGNFA accesses v; if there is a 
subpath P representing a g-gram ends at v;. Let c,, be the number 
of such paths end at v;, so v; is accessed cy, times. Let Nm be the 
maximal number of c,, among all nodes of A<, Ne be the number 
of nodes of Ac, BuildGNFA has the time complexity O(N Nm). 

Recall lmin is the minimal matching length of a regex Q. 
For the case that a regex cannot be represented by a GNFA with 
q-grams (i.€., lmin < q), we can compute a GNFA Ag whose 
transitions are lmin-length substrings. For each substring s in Ag, 
we can easily get all its related g-grams that have a prefix s, the 
list of s can be computed by merging all related g-gram lists. 


7 EXPERIMENTS 


In this section, we present experimental results. We conducted ex- 
periments on three real document collections with extracted regex 
queries and genomic sequences with real regex query workloads. 


7.1 


We used five real datasets in the experiment, including three 
document collections and two genomic sequences. 


e Web pages: We used 425,104 web pages which include PHP, 
Asp.net, JS and HTML files. The file length varies from 11 to 
12,275,654. The total size of the dataset is 2.09GB. 


e Source codes: We extracted 697,485 public source code files 
from Github, including Python, Java, C and C++ files. The 
length of code file varies from 13 to 662,118 and the average 
length is 4,070. The total size of the source codes is 3.69GB. 


e Wiki sentences: We extracted 32,882,358 sentences from Wiki- 
pedia and saved them into 328,823 files. The file length varies 
from 100 to 150,120 and the dataset size is 10.43GB. 


e DNA sequences: The Human Genome HG18 is used, which 
has a size 2.92GBA. 


e Protein sequences: We extracted protein sequences (198MB) 
from Swiss-Prot (24}. it contains 25 distinct characters. 

For the three document collections, we collected the real 
regex queries from the regex library and forumg!| and extracted 
open source codes. Extracted regexes contain the frequently used 
operations, such as x, +, |, = ?, [abc], {n}, {n, m}P} and the 


Experimental Setup 


1. http://www.regexlib.com/ 

2. [abc] matches any single character in the set; {n} means the preceding 
unit is matched exactly n times; {n, m} means the preceding unit is matched 
at least n times, but not more than m times. 
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Algorithm 4: BuildGNFA 


Input : A character-driven NFA Ae 

Output : The GNFA A, 

Let ps be the initial node pointer of Ae; 

Create initial node Vo for A, and push Vo into the node set S»; 
Initialize a stack S; // record created nodes of Ay 
for each subpath P starting from ps and representing a 
q-gram do 

5 Let g be the q-gram represented by P, and P’ be the 
subpath of P representing (q-1)-length suffix of g; 

6 Create a new node V for A, with the associated path P’, 
and a new transition from Vo to V with g; 

7 Push V into S, and S; 


e U Nme 


8 while S is not empty do 
9 | Vi < S.pop(); 
10 if V; is not the final node of Ag then 








11 Let P; be the associated path of Vi; 

12 Traverse A, from the last node of P; until next 
character is found in different paths, record traversed 
paths by a set Sp; 

13 for each subpath Pj in Sp do 

14 P + Concat the paths P; and Pj; 

15 Let g be the q-gram represented by P; 

16 if P’s last node is the final node of A. then 

17 if S, does not contain final node then 

18 V < create a final node for Ag; 

19 Push V into S,; 

20 else 

21 V + get the final node from S,; 

22 else 

23 Let P’ be the subpath of P representing the 

(q-1)-length suffix of g; 

24 if V < S,.find(P’) then 

25 Create a new node V for Ag with P’; 

26 | Push V into S, and S; 

27 Create a new transition from V; to V with g; 











28 return Ag; 


wildcard (.) as well as frequently used character classes, such as 
\w, \W, \d, \s. 

We manually classified the regexes into 6 categories. Firstly, 
we classified the queries according to whether they contain re- 
peating units. Specially, the queries containing the unit (.+) are 
classified into one category (C3), since most of algorithms need a 
high cost to match (.*). Then, the queries are further classified by 
the number of simple paths in the GNFA (i.e., the paths without 
any repeating unit). The details of queries are shown in Table [2] 


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2992295, IEEE 














TABLE 2 
Categories for the extracted regular expressions 
C1 C2 | C3 | C4 Cs Ce 
Repeating unit Wi Wi x x x 
# of simple paths | 1-10 | >10 | NA | 1-10 | 11-100 | >100 
# of queries 9 9 25 16 31 21 


























In addition, we conducted experiments on Human genome 
and Protein sequences using two real query workloads DNA 
motif patterns |25] and PROSITE patterns [4], which respectively 
consist of 574 and 505 patterns. 

We compared the state-of-the-art algorithms GNUgrep (26], 
NRGrep [i], RE2 and N-Factor [2]. To be fair in comparison, 
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the online algorithms (GNUgrep, NRGrep and RE2) are modified 
so that they only verify a document starting from the occurrences 
of q-length prefixes of the queries whose positions can be quickly 
obtained by the positional g-grams inverted index. We compared 
the algorithms Extend and TreeMatch proposed in this paper, 
which compiled the query plan once using the average gram list 
size of all documents in a dataset. 

In order to achieve a good ratio of speed to indexing overhead, 
the 5-gram inverted index was used for the DNA sequences. For 
the remaining datasets with a larger alphabet size, the 3-gram 
inverted index was used since there are too few distinct 2-grams 
and too many distinct 4-grams. 

All algorithms were implemented using C++. The experiments 
were run on a PC with an Intel 3.40GHz Quad Core CPU i7 and 
8GB memory, running an Ubuntu (Linux) 64-bit operating system. 
We first introduce the experiments on three document collections, 
then show the experiment using two real query workloads. 


7.2 Comparison with Alternative Algorithms 


Our first experiment compared the running time of our algorithms 
and the comparative algorithms. For each algorithm, we recorded 
the time of matching a query in all documents within a dataset, 
then averaged the running time of all queries in a category. 

We plotted the running time of different algorithms in Fig. 
We can see the running time increases as the number of simple 
paths for queries increases for all algorithms, since they need more 
time to answer the queries with a complex structure. TreeMatch 
spent the least running time on different datasets and Extend 
was the runner-up under most queries, e.g., for Cs in the web 
page dataset, TreeMatch and Extend cost 201ms and 441ms, 
against the time 1995ms, 2489ms, 3574ms, 2179ms and 2162ms 
spent by GNUgrep, NRGrep, RE2 and N-Factor, respectively. 
We observed NRGrep was efficient for the regexes with a simple 
structure and became slow as the language size of regex increases, 
since it adopts a bit-parallel automaton to do verifications, which 
has a high cost for the complex queries. For example, NRGrep 
was the runner-up for C4 on the three datasets, but its running 
time was at least 7x to the time of TreeMatch for Cs and Ce. 

Besides, we also tested the automaton building time for each 
algorithm. We can see from Fig. [13] all algorithms can efficiently 
build the automata at a millisecond-level. Since our proposed 
methods need to build the GNFA from a Thompson NFA, they 
spent more building time than GNUgrep, RE2, and N-Factor. 
However, it is well worth constructing the GNFA, since our 
methods spent much less overall matching time. Meanwhile, we 
noticed that NRGrep spent more automaton building time than our 
proposed methods for the queries C1, and C3-C's. The reason is 
NRGrep also spent more time to compute a bit-parallel automaton. 


7.3 Evaluating GNFA-based Algorithms 


There are two phases to find occurrences from a dataset, which 
are query plan compiling and executing phases. Extend computes 
a GNFA from the query in the compiling phase, while TreeMatch 
needs to compute a GNFA and build a plan tree from the GNFA. 
For an in-depth analysis of our proposed algorithms, we separately 
tested the running time of these phases and annotate them by 
GNFA, TREE and EXEC. 

Extend and TreeMatch have the same the procedure of 
computing GNFA, so the time of computing GNFA for them was 
same, as shown in Fig. We observed GNFA was computed 
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efficiently for all queries, e.g., the most time used for computing 
GNFA was 4.38ms for the queries in Cg. Compared to Extend, 
TreeMatch has an extra time to compute the plan tree, but the less 
time is spent to perform the tree-based query plan. For example, 
TreeMatch spent 0.92ms and 199ms to compute plan tree and 
perform it for C5 on web pages, compared to 440ms spent by 
Extend to perform the GNFA-based query plan. 


7.4 Effect of Optimization Techniques 


In Sec. we propose two optimizations to reduce the un- 
necessary checks for positional constraints. This experiment was 
designed to test the effects of these techniques. Extend was 
selected as the representative of GNFA-based algorithms, Extend- 
exclusive and Extend-skip represent the algorithms applying these 
two techniques, respectively. Extend-all and Extend-none are the 
algorithms that utilize all and none of these optimizations. 

Extend-none performed much slower without these optimiza- 
tions from Fig. the gap was most obvious for Cg on source 
code dataset, e.g., Extend-all only cost 1514ms, compared to 
2568ms spent by Extend-none. We observed the optimization of 
skipping invalid starting positions affects the efficiency most, since 
the time of Extend-skip was closest to the time of Extend-all for 
most of the queries, such as C’3-Cg on web pages. 


7.5 Size of Gram-driven NFA 


In this experiment, we compared the size (number of transitions) 
of GNFA with Thompson NFA. All extracted regex queries were 
tested and the results were shown in Fig. GNFA has the 
smaller size than Thompson NFA in 83% queries (92 queries) 
since (1) transitions with g-gram labels could be fewer than the 
transition with character labels for a regex query; (2) Thompson 
NFA contains €-transitions which are auxiliary transitions. 











1000 

R Thompson NFA + ® 
GNFA œ 

8 800 o 8 

3 p ə 

S 600 7 

g i ° 

O o 

© 400 R : ee aos 

jj o o + + eo 

-d ae +4 4 Bi g 

q 200+ S D rat soe S ai +4 

a 0 20 40 60 80 100 


Regular expressions 


Fig. 16. Thompson NFA vs GNFA. 


7.6 


We tested the index size and construction time of q-gram po- 
sitional inverted index for different document collections. The 
positional g-gram inverted lists were computed for each document, 
and all inverted lists used the uncompressed integer representation. 

We compared the bit vector-based index BITINDEX used by 
N-Factor. For each dataset, the size of inverted lists was almost 
4x to the size of dataset since there are |T| — q + 1 grams for a 
document with a size |T|. BITINDEX took more spaces since its 


Index Size and Construction Time 


space complexity is O(|%| El) (where ws is the word size in 
memory) and |£] for these datasets are 89, 93 and 68, respectively. 
For example, the sizes of inverted lists were 10.35GB, 16.77GB 


1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 


Authorized licensed use limited to: Western Sydney University. Downloaded on August 16,2020 at 08:40:04 UTC from IEEE Xplore. Restrictions apply. 


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2992295, IEEE 


Transactions on Knowledge and Data Engineering 


JOURNAL OF ATEX CLASS FILES, VOL. XX, NO. X, XXX XXXX 









































Gls TreeMatch MS Extend GNUgrep RGrep RE2 

Aas 

gio 2 

Zant = £: N Baad 
7 10 @ 10 _ i A v10 
RETH 410° | 54193 
vp v LEL H p 
%10? 2107 j B| 0102 
4 BS y| £ 

gq 10 g 10 i q 10 
3 2 1 Li iy á LEA 3 

& 1 i Ga io oA | mih č 1 i 
101 N 1o71 Li E SNA E lln 1 














12 








Automaton building 





























Regular expressions 


(a) Web Pages. 


Cy C3 C3 
Regular expressions 


(b) Source Codes. 








Cy Cs Ce Cy C3 C, 


Regular expressions 


(c) Wiki Sentences. 


Fig. 13. Performance comparison of different algorithms on different datasets. 
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Fig. 17. Index size and construction time. 


and 47.07GB for the web pages, source codes and wiki sentences 
whose dataset sizes were 2.09GB, 3.69GB and 10.43GB, and the 
sizes of BITINDEX were 20.8GB, 43.3GB and 88.6GB. 

Fig. shows the index construction time, including the 
time of dataset loading, inverted lists computing and index file 
saving. Due to positional inverted lists has the smaller index size, 
its construction time was also less than that of BITINDEX. 


7.7 Experiments on Real Regex Query Workload 


In the last experiment, we compared the performance of compar- 
ative algorithms using two real query workloads. TreeMatch was 
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used as the representative of GNFA-based algorithms. Two addi- 
tional algorithms were compared, 1) the algorithm TreeMatch- 
scan which uses the scanning search for checking a positional 
constraint; 2) a g-gram based method which enumerates all match- 
ing strings for a regex, then performs positional intersection 
on qg-gram inverted lists for each matching string. 


Fig. [18}.a)-(c) show the results on the PROSITE query work- 
load and Fig. EG) shows the query information. We can see the 
number of paths in the GNFA increases as the increase of query 
length. But the number of matching results shows the different 
trend since a query with more characters leads to longer matching 
strings, which reduces the query selectivity. In each figure, the 
number of queries in a category is also labeled in the X-axis. 


Fig. Esta) shows the running time of algorithms by varying 
the GNFA path number. Note that the GNFA of a PROSITE 
query could have a large number of distinct paths since it can 
contain many wildcards. As the increasing of path number, the 
running time of different algorithms also increases. TreeMatch 
cost the least running time when the path number is less than 10°, 
e.g., for the queries s.t. the path number in (1,10°], the running 
time of TreeMatch was only 7.72ms. Also, we found TreeMatch 
almost cost the same time to TreeMatch-scan. Because only 
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true positive results for the queries with a high selectivity, for the 
queries which have fewer matching results, there are more paths 
in their GNFAs (Fig. [18}d)), so more time was used to prune the 
false positives from different paths. 

Fig. [18}e)-(g) show the experimental results on the Motif 
workload. From Fig. [I8}g). we can find the Motif query is more 
simple than the PROSITE query, but it has the more occurrences. 
The number of paths decreased as the query length among the 
range (31,42], the reason is there are some Motif queries that 
contain few (|) operators but have many consecutive characters. 
This is also the reason for the running time on different algorithms 
decreases when the query length among the largest range (31, 42] 
in Fig. [I8{f). 

In the last part, we compare the performance of different 
algorithms from the view of total running time for all queries. As 
shown in Fig.{19[a), the superiority of TreeMatch is more obvious 
from the view of total running time, e.g., TreeMatch only spent 
33mins to answer all Motif queries, while N-Factor, NRGrep, 
q-gram baseline, GNUgrep and RE2 spent 110mins, 307mins, 
469mins, 593mins and 743mins, respectively. Fig. [19[b) shows 
the results of speedups of TreeMatch against other algorithms 
(i.e., time for other algorithm/time for TreeMatch), the speedups 
of TreeMatch against other algorithms vary from 3.33x to 22.51x 
for the Motif workload. 





8 RELATED WORK 


(1) Online Regex Matching Algorithms 
The classical algorithms generally convert the regex query into 
an equivalent finite automaton (NFA/DFA), then run it from each 


position in the document to verify if the substring is an occurrence 
of the regex query. An occurrence will be reported whenever a 
final state of the automaton is reached (28}, {29}, 30], : 

In order to avoid checking every position of the document, 
an alternative method is to adopt the two-phase matching strategy 
which first locates all the candidate regions in the document by 
multiple string matching algorithms, then utilizes the classical 
regex matching algorithms to verify them [10], [11], [26]. GNU 
grep [26] utilizes the necessary factor, which the substring of 
regex query, to locate the candidate positions. Watson introduces 
the algorithm MultiStringRE which utilizes the prefixes of 
regex to locate the candidate positions, then verifies them by 
a Commentz-Water-like algorithm. NRgrep is an extension of 
BNDM algorithm (11), which utilizes the modified DFA to match 
every reversed prefix of regex. 


(2) Index-based Regex Matching Algorithms 


Yang et al. propose the N-Factor technique to accelerate 
the regex matching, which also adopts the two-phase strategy. N- 
Factor essentially is the substring that cannot exist in the matching 
results of a query. This work utilizes the N-Factors of a query to 
further prune the candidate regions computed through the prefixes 
and suffixes of the query, by checking if the candidate region 
contains any N-Factor. N-Factor utilizes the BITINDEX to support 
the efficient bit-parallel based filtering algorithms. However, N- 
Factor provides little filtering power for the complex regex query 
(e.g., containing wildcards or character sets), and BITINDEX is 
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large for the dataset with a large alphabet. 

Some works take efforts to find matching results of single 
regex query from document collections (2). (13). 04. they gener- 
ally find candidate documents which contain the enough substrings 
of Q by inverted lists, then invoke the automaton based matching 
algorithms to do verification. Besides, recent works also study 
the problem of multiple regex matching which aims to efficiently 
answer the multiple regex queries on a given document {8}. (91. 
[32]. They are orthogonal with the problem studied in this paper. 


(3) Multi-list Intersection Algorithms 

Multi-list intersection has been well-studied in the community. 
The classical SvS algorithm [20] reduces this problem to a 
sequence of pairwise intersections. Search-based algorithms 5 
[20], B3]. set one as a pivot and use an efficient search 
to intersection candidates in the remaining lists. The hash-based 
approach accelerates the intersection by off-line building the 
hash structures. In this paper, the regex occurrences are computed 
through the multi-list intersection based technique, so the search- 
based techniques can also be applied for the regex matching, 
which modify the strategy of checking positional constraints with 
q-gram inverted lists. 


(4) Algorithms Using Positional Inverted Index 

The positional inverted index is widely used in many 
applications. In the area of string similarity search, the positional 
q-gram inverted index is utilized to identify the candidates from 
a long string for the similarity query , [36], [37]. Besides, the 
inverted index with position informations is widely used for the 
phrase querying in information retrieval , [39]. 


9 CONCLUSION 


In this paper, we study the efficient regular expression matching 
problem. We propose a gram-driven NFA to represent the language 
of regex, which gives rise to the opportunity to design efficient 
positional index-based query plans. Based on GNFA, we propose 
the GNFA-based query plan which answers the query purely 
based on the positional inverted index. Experiments show that our 
method outperforms the state-of-the-art methods. 

Our approach is designed for the scenario where a server needs 
to process many queries on a static dataset. Nevertheless, we note 
that there is an analogy with the typical full-text search systems. 
To accommodate insertion updates (arguably the most common 
type of update workload), one can build a semi-dynamic version 
of the index, in which the query cost has an additional log n 
factor [40]. We can use the same idea to handle insertion-only 
update workload. 
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